Improved Methods for Early Fault Detection in Enterprise Computing Servers Using SAS Tools
نویسنده
چکیده
Advanced telemetry systems are being developed to collect and archive hundreds of system performance, throughput, quality-of-service (QoS), and physical variables for the purpose of enhancing the reliability, availability, serviceability, scalability, and security of business-critical enterprise computing servers. SAS software was chosen for this project because of the language's powerful coding features and its ability to model prototype systems quickly and costeffectively. SAS macro language was exploited in a parametric sensitivity study to explore and optimize clusters of signals that provide high sensitivity for early fault detection, but with a good avoidance of false alarms, for subsequent post-processing with an advanced statistical patternrecognition surveillance system called MSET. Finally, PROC GPLOT and PROC G3D proved indispensable for displaying results of our many-variable sensitivity investigation. SAS programs developed for this investigation are generic rather than operating system specific. The results presented in this paper used SAS 8.2 for Unix environments operating on Solaris 8 platforms. The coding practices discussed in this paper are aimed at users with an average SAS/GRAPH experience and an average proficiency with SAS macro language. Introduction Fault detection in complex systems typically requires costly on-line monitoring and expertise. Conventional approaches to identifying faults, combining event correlation and threshold-based rules, have proven inadequate in a variety of safety-critical industries with complex, heterogeneous subsystem inputs not dissimilar to those from enterprise computing. Fundamentally, while many high-end computing servers are already rich in instrumentation, the data produced by the instrumentation are complex, non-uniform, and difficult to correlate. Pattern recognition technology, coupled with continuous system telemetry, offers key enabling technology that can help: • identify incipient faults proactively • reduce the expertise required to identify correlations • reduce the compute cost of monitoring • reduce the probability of false alarms • increase the accuracy and timeliness of root cause analysis • help eliminate “No-Trouble-Found” (NTF) events that can drive up warranty & serviceability costs for a server vendor The effectiveness of using pattern recognition to discern incipient faults in noisy process data coupled with continuous system telemetry is gated by the quality of information available from instrumentation. This investigation was undertaken to extract, pre-process, and provide analytical resampling for a vast variety of performance, quality-of-service, and system load metrics for SUGI 29 Posters
منابع مشابه
An approach to fault detection and correction in design of systems using of Turbo codes
We present an approach to design of fault tolerant computing systems. In this paper, a technique is employed that enable the combination of several codes, in order to obtain flexibility in the design of error correcting codes. Code combining techniques are very effective, which one of these codes are turbo codes. The Algorithm-based fault tolerance techniques that to detect errors rely on the c...
متن کامل"Ins" and "Outs" of Installing and Configuring the SAS® Enterprise BI Server at Blue Cross & Blue Shield of Minnesota
This paper discusses the implementation of SAS Enterprise BI Server at BlueCross BlueShield of Minnesota (BCBSM). It provides an overview of the hardware and software architecture and the deployment of SAS Enterprise BI Server within a mature enterprise-wide and external web-facing infrastructure in a multi-tier UNIX environment. This paper also provides highlights of the installation and confi...
متن کاملProactive Fault Monitoring in Enterprise Servers
New proactive fault monitoring innovations are being developed, demonstrated on executing servers, and productized for enhancing the reliability, availability, and serviceability of enterprise-class servers. A continuous system telemetry harness (CSTH) has been developed that collects time series signals relating to the health of dynamically executing servers. These time series provide quantita...
متن کاملA Practical Approach to Re-Architecting a SAS Deployment
In today's business environment, enterprise computing deployments must be able to handle the challenges that companies face while adhering to IT standards. With the SAS platform being a multi-tiered environment consisting of components residing on client desktops, middle-tier Web servers, compute servers, and data assets, SAS customers are looking for ways to modernize their environment without...
متن کاملDetection of high impedance faults in distribution networks using Discrete Fourier Transform
In this paper, a new method for extracting dynamic properties for High Impedance Fault (HIF) detection using discrete Fourier transform (DFT) is proposed. Unlike conventional methods that use features extracted from data windows after fault to detect high impedance fault, in the proposed method, using the disturbance detection algorithm in the network, the normalized changes of the selected fea...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004